35 research outputs found

    GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists

    Get PDF
    We present GENECODIS, a web-based tool that integrates different sources of information to search for annotations that frequently co-occur in a set of genes and rank them by statistical significance. The analysis of concurrent annotations provides significant information for the biologic interpretation of high-throughput experiments and may outperform the results of standard methods for the functional analysis of gene lists. GENECODIS is publicly available at

    Discovering semantic features in the literature: a foundation for building functional associations

    Get PDF
    BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data

    Integrated analysis of gene expression by association rules discovery

    Get PDF
    BACKGROUND: Microarray technology is generating huge amounts of data about the expression level of thousands of genes, or even whole genomes, across different experimental conditions. To extract biological knowledge, and to fully understand such datasets, it is essential to include external biological information about genes and gene products to the analysis of expression data. However, most of the current approaches to analyze microarray datasets are mainly focused on the analysis of experimental data, and external biological information is incorporated as a posterior process. RESULTS: In this study we present a method for the integrative analysis of microarray data based on the Association Rules Discovery data mining technique. The approach integrates gene annotations and expression data to discover intrinsic associations among both data sources based on co-occurrence patterns. We applied the proposed methodology to the analysis of gene expression datasets in which genes were annotated with metabolic pathways, transcriptional regulators and Gene Ontology categories. Automatically extracted associations revealed significant relationships among these gene attributes and expression patterns, where many of them are clearly supported by recently reported work. CONCLUSION: The integration of external biological information and gene expression data can provide insights about the biological processes associated to gene expression programs. In this paper we show that the proposed methodology is able to integrate multiple gene annotations and expression data in the same analytic framework and extract meaningful associations among heterogeneous sources of data. An implementation of the method is included in the Engene software package

    Biclustering of gene expression data by non-smooth non-negative matrix factorization

    Get PDF
    BACKGROUND: The extended use of microarray technologies has enabled the generation and accumulation of gene expression datasets that contain expression levels of thousands of genes across tens or hundreds of different experimental conditions. One of the major challenges in the analysis of such datasets is to discover local structures composed by sets of genes that show coherent expression patterns across subsets of experimental conditions. These patterns may provide clues about the main biological processes associated to different physiological states. RESULTS: In this work we present a methodology able to cluster genes and conditions highly related in sub-portions of the data. Our approach is based on a new data mining technique, Non-smooth Non-Negative Matrix Factorization (nsNMF), able to identify localized patterns in large datasets. We assessed the potential of this methodology analyzing several synthetic datasets as well as two large and heterogeneous sets of gene expression profiles. In all cases the method was able to identify localized features related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The uncovered structures showed a clear biological meaning in terms of relationships among functional annotations of genes and the phenotypes or physiological states of the associated conditions. CONCLUSION: The proposed approach can be a useful tool to analyze large and heterogeneous gene expression datasets. The method is able to identify complex relationships among genes and conditions that are difficult to identify by standard clustering algorithms

    MARQ: an online tool to mine GEO for experiments with similar or opposite gene expression signatures

    Get PDF
    The enormous amount of data available in public gene expression repositories such as Gene Expression Omnibus (GEO) offers an inestimable resource to explore gene expression programs across several organisms and conditions. This information can be used to discover experiments that induce similar or opposite gene expression patterns to a given query, which in turn may lead to the discovery of new relationships among diseases, drugs or pathways, as well as the generation of new hypotheses. In this work, we present MARQ, a web-based application that allows researchers to compare a query set of genes, e.g. a set of over- and under-expressed genes, against a signature database built from GEO datasets for different organisms and platforms. MARQ offers an easy-to-use and integrated environment to mine GEO, in order to identify conditions that induce similar or opposite gene expression patterns to a given experimental condition. MARQ also includes additional functionalities for the exploration of the results, including a meta-analysis pipeline to find genes that are differentially expressed across different experiments. The application is freely available at http://marq.dacya.ucm.es

    Functional Enrichment Analysis of Regulatory Elements

    Get PDF
    This work has been partially supported by FEDER/Junta de Andalucia-Consejeria de Economia y Conocimiento/(grant CV20-36723), grant PID2020-119032RB-I00, MCIN/AEI/10.13039/501100011033 and FEDER/Junta de Andalucia-Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades (Grant P20_00335).Statistical methods for enrichment analysis are important tools to extract biological information from omics experiments. Although these methods have been widely used for the analysis of gene and protein lists, the development of high-throughput technologies for regulatory elements demands dedicated statistical and bioinformatics tools. Here, we present a set of enrichment analysis methods for regulatory elements, including CpG sites, miRNAs, and transcription factors. Statistical significance is determined via a power weighting function for target genes and tested by theWallenius noncentral hypergeometric distribution model to avoid selection bias. These new methodologies have been applied to the analysis of a set of miRNAs associated with arrhythmia, showing the potential of this tool to extract biological information from a list of regulatory elements. These new methods are available in GeneCodis 4, a web tool able to perform singular and modular enrichment analysis that allows the integration of heterogeneous information.FEDER/Junta de Andalucia-Consejeria de Economia y Conocimiento CV20-36723MCIN/AEI PID2020-119032RB-I00FEDER/Junta de Andalucia-Consejeria de Transformacion Economica, Industria, Conocimiento y Universidades P20_0033

    sRNAbench and sRNAtoolbox 2022 update: accurate miRNA and sncRNA profiling for model and non-model organisms

    Get PDF
    The NCBI Sequence Read Archive currently hosts microRNA sequencing data for over 800 different species, evidencing the existence of a broad taxonomic distribution in the field of small RNA research. Simultaneously, the number of samples per miRNA-seq study continues to increase resulting in a vast amount of data that requires accurate, fast and user-friendly analysis methods. Since the previous release of sRNAtoolbox in 2019, 55 000 sRNAbench jobs have been submitted which has motivated many improvements in its usability and the scope of the underlying annotation database. With this update, users can upload an unlimited number of samples or import them from Google Drive, Dropbox or URLs. Micro- and small RNA profiling can now be carried out using high-confidence Metazoan and plant specific databases, MirGeneDB and PmiREN respectively, together with genome assemblies and libraries from 441 Ensembl species. The new results page includes straightforward sample annotation to allow downstream differential expression analysis with sRNAde. Unassigned reads can also be explored by means of a new tool that performs mapping to microbial references, which can reveal contamination events or biologically meaningful findings as we describe in the example. sRNAtoolbox is available at: https://arn.ugr.es/srnatoolbox/</a

    bioNMF: a versatile tool for non-negative matrix factorization in biology

    Get PDF
    BACKGROUND: In the Bioinformatics field, a great deal of interest has been given to Non-negative matrix factorization technique (NMF), due to its capability of providing new insights and relevant information about the complex latent relationships in experimental data sets. This method, and some of its variants, has been successfully applied to gene expression, sequence analysis, functional characterization of genes and text mining. Even if the interest on this technique by the bioinformatics community has been increased during the last few years, there are not many available simple standalone tools to specifically perform these types of data analysis in an integrated environment. RESULTS: In this work we propose a versatile and user-friendly tool that implements the NMF methodology in different analysis contexts to support some of the most important reported applications of this new methodology. This includes clustering and biclustering gene expression data, protein sequence analysis, text mining of biomedical literature and sample classification using gene expression. The tool, which is named bioNMF, also contains a user-friendly graphical interface to explore results in an interactive manner and facilitate in this way the exploratory data analysis process. CONCLUSION: bioNMF is a standalone versatile application which does not require any special installation or libraries. It can be used for most of the multiple applications proposed in the bioinformatics field or to support new research using this method. This tool is publicly available at

    A literature-based similarity metric for biological processes

    Get PDF
    BACKGROUND: Recent analyses in systems biology pursue the discovery of functional modules within the cell. Recognition of such modules requires the integrative analysis of genome-wide experimental data together with available functional schemes. In this line, methods to bridge the gap between the abstract definitions of cellular processes in current schemes and the interlinked nature of biological networks are required. RESULTS: This work explores the use of the scientific literature to establish potential relationships among cellular processes. To this end we haveused a document based similarity method to compute pair-wise similarities of the biological processes described in the Gene Ontology (GO). The method has been applied to the biological processes annotated for the Saccharomyces cerevisiae genome. We compared our results with similarities obtained with two ontology-based metrics, as well as with gene product annotation relationships. We show that the literature-based metric conserves most direct ontological relationships, while reveals biologically sounded similarities that are not obtained using ontology-based metrics and/or genome annotation. CONCLUSION: The scientific literature is a valuable source of information from which to compute similarities among biological processes. The associations discovered by literature analysis are a valuable complement to those encoded in existing functional schemes, and those that arise by genome annotation. These similarities can be used to conveniently map the interlinked structure of cellular processes in a particular organism

    Analysis of events with b-jets and a pair of leptons of the same charge in pp collisions at √s=8 TeV with the ATLAS detector

    Get PDF
    An analysis is presented of events containing jets including at least one b-tagged jet, sizeable missing transverse momentum, and at least two leptons including a pair of the same electric charge, with the scalar sum of the jet and lepton transverse momenta being large. A data sample with an integrated luminosity of 20.3 fb−1 of pp collisions at √s=8 TeV recorded by the ATLAS detector at the Large Hadron Collider is used. Standard Model processes rarely produce these final states, but there are several models of physics beyond the Standard Model that predict an enhanced rate of production of such events; the ones considered here are production of vector-like quarks, enhanced four-top-quark production, pair production of chiral bâ€Č-quarks, and production of two positively charged top quarks. Eleven signal regions are defined; subsets of these regions are combined when searching for each class of models. In the three signal regions primarily sensitive to positively charged top quark pair production, the data yield is consistent with the background expectation. There are more data events than expected from background in the set of eight signal regions defined for searching for vector-like quarks and chiral bâ€Č-quarks, but the significance of the discrepancy is less than two standard deviations. The discrepancy reaches 2.5 standard deviations in the set of five signal regions defined for searching for four-top-quark production. The results are used to set 95% CL limits on various models
    corecore